tensor computation
Reviews: Singleshot : a scalable Tucker tensor decomposition
The paper proposes efficient methods for computing the Tucker decomposition of higher-order tensors. The problem is a hard, basic problem in numerical linear algebra with reasonably wide applicability. Tensor decompositions have played an important role in a variety of machine learning applications, see for example: Anandkumar et al "Tensor Decompositions for Learning Latent Variable Models" JMLR 2014; Novikov et al "Tensorizing Neural Networks" NeurIPS 2015, which used tensor decompositions to massively compress the dense layers of VGG; Moitra and Wein "Spectral Methods from Tensor Networks"; and Becker and Osman "Low rank Tucker decompositions of large tensors using tensorsketch" NeurIPS 2018. Singleshot is a coordinate descent based algorithm which applies gradient updates to variables in the Tucker decomposition, which it cycles over. The paper carefully considers the memory usage of Singleshot (and its variants) since tensor computations are often extremely memory intensive.
TensorIR: An Abstraction for Automatic Tensorized Program Optimization
Feng, Siyuan, Hou, Bohan, Jin, Hongyi, Lin, Wuwei, Shao, Junru, Lai, Ruihang, Ye, Zihao, Zheng, Lianmin, Yu, Cody Hao, Yu, Yong, Chen, Tianqi
Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms.
HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation
Xiao, Qingcheng, Zheng, Size, Wu, Bingzhe, Xu, Pengcheng, Qian, Xuehai, Liang, Yun
Tensor computations overwhelm traditional general-purpose computing devices due to the large amounts of data and operations of the computations. They call for a holistic solution composed of both hardware acceleration and software mapping. Hardware/software (HW/SW) co-design optimizes the hardware and software in concert and produces high-quality solutions. There are two main challenges in the co-design flow. First, multiple methods exist to partition tensor computation and have different impacts on performance and energy efficiency. Besides, the hardware part must be implemented by the intrinsic functions of spatial accelerators. It is hard for programmers to identify and analyze the partitioning methods manually. Second, the overall design space composed of HW/SW partitioning, hardware optimization, and software optimization is huge. The design space needs to be efficiently explored. To this end, we propose an agile co-design approach HASCO that provides an efficient HW/SW solution to dense tensor computation. We use tensor syntax trees as the unified IR, based on which we develop a two-step approach to identify partitioning methods. For each method, HASCO explores the hardware and software design spaces. We propose different algorithms for the explorations, as they have distinct objectives and evaluation costs. Concretely, we develop a multi-objective Bayesian optimization algorithm to explore hardware optimization. For software optimization, we use heuristic and Q-learning algorithms. Experiments demonstrate that HASCO achieves a 1.25X to 1.44X latency reduction through HW/SW co-design compared with developing the hardware and software separately.
Supercharge Your Shallow ML Models With Hummingbird
Since the most recent resurgence of deep learning in 2012, a lion's share of new ML libraries and frameworks have been created. The ones that have stood the test of time (PyTorch, Tensorflow, ONNX, etc) are backed by massive corporations, and likely aren't going away anytime soon. This also presents a problem, however, as the deep learning community has diverged from popular traditional ML software libraries like scikit-learn, XGBoost, and LightGBM. When it comes time for companies to bring multiple models with different software and hardware assumptions into production, things getโฆhairy. Using microservices in Kubernetes can solve the design pattern issue to an extent by keeping things de-coupledโฆif that's even what you want?
What is PyTorch and how does it work? Packt Hub
PyTorch is a Python-based scientific computing package that uses the power of graphics processing units. It is also one of the preferred deep learning research platforms built to provide maximum flexibility and speed. It is known for providing two of the most high-level features; namely, tensor computations with strong GPU acceleration support and building deep neural networks on a tape-based autograd systems. There are many existing Python libraries which have the potential to change how deep learning and artificial intelligence are performed, and this is one such library. One of the key reasons behind PyTorch's success is it is completely Pythonic and one can build neural network models effortlessly.